Attempting to quantify Gender differences using the Kaggle Developer Survey

One of the questions I try to answer in this notebook is whether there is gender inequality in the Tech space. Assuming there is some, which there probably is seeing as a good percentage of survey respondents are males, how does this impact earning potential.

In this notebook I:

This plot shows the number of men and women in each category(degree). It does not really tell us anything since we already know that this dataset is male dominated.

This plot shows the percentage of women that are in each category. Right off the bat we see that a lot of women have Master's degrees (42.7%) and this is closely followed by 34% in the Bachelor's degree category. Men have 39.7% and 36.2% in Master's and Bachelor's degree categories respectively.

In this plot, we see that a higher percentage of women have Masters and Doctoral degrees when compared to men. Seeing that the percentage of women is less in Bachelor's degree, we can assume that women prefer to further their education after getting a Bachelor's degree.

Also, a lower percentage of women have no formal education past high school and also, a lower percentage had some level of college education but they did not finish.

This plot shows how much the proportion of women in each level of education is better/worse than the proportion of women in the sample. In an ideal scenario, the proportion of women for each level of education would be the same as the proportion of women in the sample but that is not the case.

The most obvious is the No formal education past high school category. Here, the proportion of women in this category is very small (about 13.6% less than the percentage of women in the sample, 19.7%). This is a good thing since it means that a higher percentage of women has some college education at least.

Malaysia has the highest proportion of women and is 19.5% more than the proportion of women in the sample. Japan is the country has the lowest proportion of women in this sample.

Surprisingly (for me at least!), Product/Project Managers have a very low proportion of women and Statisticians have a pretty high proportion of women (probably as a result of all those degrees they bagged), this theory is supported by the fact that a higher proportion of women are students.

Notice how the roles at the tail end of the plot seem like the higher paying roles.

A higher proportion of women are in the early stages of their coding career, the proportion of women keeps decreasing as the years of coding experience increases. Since experience is a factor when talking about earning potential, this should be something to keep in mind.

From the plot above, as predicted, women are typically higher in proportion than the average sample of women in lower income ranges.

Same as the previous plot, the higher salary ranges typically have a lower proportion of women compared to the average proportion. Although there are a few higher income levels that are on the left side of the plot, there are not enough instances in the dataset to say that it is a common trend.

We see that Data Scientists, Software Engineers and Research Scientists contribute a significant amount in these salary ranges.

As expected longer years of coding experience dominates these salary ranges.

As expected, Degree holders are found more in these top salary ranges.


Building a Model

I thought it made more sense to use a regression here to try to predict salary. Although it will be very rough around the edges, I think converting the salaries from categorical to numeric will allow us to more easily interperet the data.

I did not use the drop_first argument of pd.get_dummies because of the questions that are split into multiple columns. These columns get dropped altogether since there is only one value in the column (apart from null values). So I have to drop them manually each time I want to use specific columns.

Lasso regression, Random Forest


Next, we attempt to build models that do not include what gender the survey participant belongs to. This is to serve as an extra check to see if the model would still predicts higher salaries for males even though the model did not know that the data was for males.

Scatterplot showing the female predictions vs the male predictions. Hovering over the trend line we see that the formula for the straight line is:

                             female_predictor = 0.838546 * male_predictor - 2659.72 

which shows that the female predictor predicts significantly less than the male predictor.

We see that the male predictions pop up frequently at the top of the scatterplots. This is an indication that the male predictor was predicting higher salaries more frequently.